m 



Rejoinder: Efficiency and Structure in MNIR 

Matt Taddy, The University of Chicago Booth School of Business 



I thank Prof. Blei and Grimmer for their comments; it is great to have one's work discussed 
by researchers who are both excellent statisticians and experts in their respective fields. 

The discussion can be summarized under two themes. Prof. Blei is interested in extending 
MNIR to modeling additional, often latent, structure in text. Prof. Grimmer is concerned with 
causation and interpretability. Both will be answered in context of my original motivation for 
MNIR: the estimation efficiency derived from assumptions on x\y. We'll begin with estimator 



properties in a simple illustration, then turn to discussion of latent factors and causal inference. 
J^h; 1 Efficiency 

in 



A related question of efficiency has been studied by lEfronl (|1975h and iNg and Jordan! (|2002|) 

in comparisons between logistic regression and 'generative' discriminant analysis. Efron's 

Oh ' 

<£ generative classifier applies Bayes rule to inverse multivariate normals x.\y ~ N(/i J/ , S), where 

jj y = E[x|y] varies with y E {0, 1} but the covariance matrix is shared across populations. 

Given true normal covariate distributions separated by root Mahalanobis distances of 3 to 4, he 

finds predictions from this routine to be 1.5 to 3 times more efficient than logistic regression. 

This efficiency gain is smaller than that found by Ng and Jordan for a Naive Bayes algorithm 

pg ■ (each covariate is fit as independent of the others given y), with their results loosely interpreted 



to imply log(n) times higher efficiency for the generative predictor. Although Naive Bayes 



independence is not assumed for the data itself, requirements on the amount of information 
about y available in each covariate have the effect of limiting conditional dependence. 

Our model presents a third scenario: covariate dependence is fully specified via the negative 

correlation of a multinomial. Consider binary response y E {0,1} and the joint word-sentiment 

c3 ■ 

distribution p(x,y) = MN(x | q(y))p(y) where q^y) = expfa^ + tpjy]/ ^ expfa, + ipiy] - 

that is, the collapsed model in Equation 1 of the main paper. Then the expected information for 

cp is 7 rW, where tt = EM and W = diag(qi)— qiqi with qi = q(y = 1), and standard results 



(e.g.. Ivan der Vaartlll998l chap. 5) imply that in a fixed vocabulary the variance for maximum 



likelihood estimator <p scales with M = J2i J2j x ij> m e total number of words. 



Proposition 1.1. Assume the above joint model for y and x with tt > 0, and write (pfor the 
MLEfit of ip in our collapsed MNIR model. The estimation error converges in distribution as 



^M($ - <p) w N (0, W" 1 ) 

Thus variance decreases with the amount of speech rather than with the number of speakers. 

Prediction requires an accompanying forward model. If the collapsed model holds true, 
Bayes rule implies a forward predictor and results of Proposition [Tj] apply directly. A more 
realistic scenario has the collapsed model misspecified on an individual level. Consider a model 
of individual heterogeneity such that x _LL y \ x'tp, u where cp can be estimated consistently as 
in Proposition [Tj] and u is a vector of unobserved random effects - for example, the model of 
Section 3.3 with Xij ~ Po (exp[/ij + cp^y^ + w^]) and yi _LL u^ ~ N(0, 1). Write z = cp'f = 
c£>'(x/m — - Y^i ~x-i/ m i) for projection of mean shifted frequencies F = [f x • • • f n ]', and say 
MNIR-OLS is the two-stage estimation of cp in collapsed MNIR and [a, f3] given z = Yip via 
least-squares (OLS). Consider the simple forward approximation E[y|f, u] = a + (3z (e.g., if 
y = a+/3z+~f'u+£ and Uj = dj+bjZ+Vj with Vj _LL z, then/3 = /3+7'b). Iterated expectation 
implies Eargmin J2i(Vi ~ a ~ f i ) 2 = E W] = Vfi, such that OLS and MNIR-OLS have 
the same expectation and the effect of u on z is subsumed in (3. 

The distinction of MNIR-OLS is its estimation precision. 

PROPOSITION 1 .2. Consider data from the joint word-sentiment distribution of Proposition U .1\ 
partitioned into documents {xj,|/j}" =1 where < J2iVi < n - Assuming a finite upper-bound 
for each \<pj \, the MNIR-OLS predictor y(x) for a new document x has 



M^too 2 



vax(y(x)) > a 2 



1 z 2 



En 
i=l Z i 



where z = f'cp is the true projection for x and a is residual variance for regression ofy on z. 

Proof. Note z = and var(y(x)) = var(d) + f var(^/jg)f where (3± is OLS slope on z = F<^. 
From Proposition 11.11 and the continuous mapping theorem we have (p —$■ <p and (3% -w J3 X . 
Slutsky's lemma yields ipfiz -w (p(3 z with variance <pv&Y(/3 z )ip' = a 2 <p<p' ' / J2i z f- Given that 
(p M- (p/3 z is bounded on its finite domain, the Portmanteau lemma implies our convergence. 

Thus, in our simple cartoon, MNIR-OLS approaches with number-of-words the error rate 
of univariate least- squares. This holds for infill (where n is constant but speech-per-document 
grows) as well as when n is growing with M and the right-hand-side of 11.21 is decreasing. 
Regularized estimation, say as applied in the main article, should help efficiency in tougher 



setups (e.g., where vocabulary grows with M) but will increase bias. Although we'v e focusec 



on lin ear models many other options are available - for example, tree methods (e.g., 



BreimanL 



2001) work well in low dimensions for nonlinearity and variable interaction. The principles 
remain the same: results like Proposition (11.11 ) show efficiency in collapsed IR, and one hopes to 
be able to account for individual-level misspecification in the low dimensional forward model. 

2 Latent factors 

Prof. Blei's 2nd extension is an especially promising idea. Random effects were originally 
viewed as a nuisance necessary for understanding misspecification. However, a low-dimensional 
latent factorization of these effects would be a powerful tool for exploration and prediction. It 
provides a middle ground between LDA and MNIR. 

Such a model has log-odds 77 = ex. + $y + Tu where u = [u\ . . . uk}' is a A'-dimensional 
factor vector. T can then be interpreted as logit-transformed LDA topics for variation in text 
not explained by variables in y. Just as <fr'x is sufficient for y, the topic projection T'x will be 
sufficient for latent factors. Therefore the model provides both a new way to think about latent 
structure in text and a strategy for fast computation of topic weights. 

The difficulty with latent factor modeling is estimation. On the one hand, although the 
model is more complex, estimation variance should still decrease with M because of the multi- 
nomial assumption on x (indeed, similar arguments can explain the solid performance of LDA 
and sLDA regression). However, there are two big computational issues in posterior maximiza- 
tion with document- specific Tu^: you can no longer collapse the likelihood, and you need to 
jointly solve for T and U = [iii . . . u n ]'. Since the discussants and I work on corpora many 
orders larger than the examples in this article, additional latent structure is only useful if we 
can devise scalable algorithms for its estimation. 

On the lack of collapsibility, which is also an issue for high-di mensional y, I have had 



success applying a MapReduce strategy (|Dean and Ghemawat . 



2004). A factorized likelihood 



is obtained by assuming counts Xij and Xik for j ^ k are independent and Poisson distributed 
given Yi and Uj (centered on intensity exp(mj/p) for convenience). The Map step groups counts 
on each column of X (i.e., for each word) and the Reduce step is a (possibly zero-inflated) 
Poisson log regression of each word count onto y, and Uj. Exponential family parametrization 
of the Poisson allows the same sufficiency results, and the multinomial distribution for vectors 
of independent Poissons given their sum implies a close connection to MNIR. A paper on this 
approach to distributed multinomial regression is under preparation. 



Even with these parallel algorithms, it is difficult to solve for both U and I\ A fixed-point 
solver (iterating between maximization for each conditional on the other) is usually too slow. 
One could impute a rough guess for U (e.g., from a PCA of document tf-idf), but this is only 



a stan d-in solution. Recent advances in distributed optimization using ADMM (IBoyd et al 
2010() may offer a way forward, iterating from unique TX, for each j th word towards shared 



U across vocabulary, but this is just conjecture. The problem of latent factor MNIR for large 
corpora remains unsolved. I look forward to further discussion with Prof. Blei on this because 
it is something that his lab, if anybody, has a good chance of tackling. 

3 Interpretability 

Prof. Grimmer's comments are focused on interpretability: the translation from estimated mod- 
els to scientific mechanisms. In particular, he and other social scientists are interested in ques- 
tions of causation. This is among the toughest of topics in statistics, and one that is only 
growing in both difficulty and importance with the amount and dimension of our data. 

First, we should not underestimate the importance of predictive ability in causal modeling. 
The goal is always good prediction, but to understand causation we want a model that predicts 
well when one covariate changes and all others stay constant. Some of the best causal inference 
schemes are explicitly predictive: matching, treatment-effects models, and propensity scores 
rely upon estimation of the rate at which treated individuals were assigned to that group. As 
an example, colleagues and I are interested in measuring attribution for digital advertisements 
(i.e., how an ad causes changes in consumer behavior). This is a notoriously tough problem, 
since the fact that a consumer sees an ad is highly correlated with the likelihood that they were 
already looking to buy a certain product. MNIR for a consumer's text (e.g. on social media) 
and their browser history (where website counts are treated like word counts) can be used to 
efficiently predict the probabilities both that they see an ad and that they buy a product, and we 
hope to use this to disentangle these correlated outcomes. 

However, instead of using text to help control for unobserved variables, Prof. Grimmer is 
seeking methods to infer the mechanisms behind word choice. This is because he rightly wants 
to ensure that word loadings correspond to a general notion of partisanship - one that is portable 
between, say, newspapers and congressional speech. This is the causal problem exploded to 
simultaneous inference for thousands of correlated outputs. Regardless, MNIR is a natural 
starting point: I assume that 'sentiment' causes speech rather than the inverse. From this one 
can look to apply the structural models used in econometrics and biostatistics. As mentioned, 



the effects of other inputs are 'controlled for' by including them in the log-odds, say as 77 = a + 
tpy + 0v where v = \v\ ... f^]' are confounding variables. Going further, an MNIR treatment 
effects estimator would regress y on v and include the fitted expectation in the equation for 
77. One needs to be careful here, as techniques used for efficiency i n high dimensions, s uch 



as sparse regularization, can bias inference in unexpected ways. See iBelloni et al.l (120121) for 
recent work on sparse high-dimensional treatment effects estimation. 

Finally, we should be aware of the limits of frameworks like MNIR (this also relates to Prof. 
Blei's 3rd extension). As Prof. Grimmer says, it is difficult to know what covariates should be 
included or excluded from the model. However, this will always be as much of a problem in 
text analysis as it has long been in social science. The 'what' that we measure is only ever 
defined in terms of observables and the model assumed around them (even with human coders 
sentiment is dictated by the questions we ask). The goal is to have this be as close as possible 
to our abstract ideal. For example, an ongoing project at Booth is investigating the history of 
partisanship in congressional speech. To define partisanship, we look at average predictability 
of party identity given words drawn from the distribution of speech for a given party. The 
question of partisanship has been transformed to one of predictability, and this notion is refined 
by controlling for causes of word choice (e.g., geography, race) that we understand as distinct 
from partisanship. It is healthy to keep this inference separate from abstract meanings for 
sentiment or partisanship, in order to be clear on where evidence ends and speculation begins. 

Thanks to Jesse Shapiro, Matt Gentzkow, and Christian Hansen for helpful discussion. 
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